本文是《Hands-on Machine Learning with Scikit-Learn and TensorFlow 》的课后习题解答！

把练习和答案分开是为了督促学习和思考，提倡独立思考，主动学习，而不是背答案！文章前半段中文翻译有问题的地方欢迎评论指正，英文原文在后面。

《Hands-on Machine Learning with Scikit-Learn and TensorFlow 》资源

标题	说明	附加
《Hands-on Machine Learning with Scikit-Learn and TensorFlow 》	书籍配套GItHub代码
Sklearn 与 TensorFlow 机器学习实用指南	在线阅读

CHAPTER 1 The Machine Learning Landscape

练习

本章中，我们学习了一些机器学习中最为重要的概念。下一章，我们会更加深入，并写一些代码。开始下章之前，确保你能回答下面的问题：

如何定义机器学习？
机器学习可以解决的四类问题？
什么是带标签的训练集？
最常见的两个监督任务是什么？
指出四个常见的非监督任务？
要让一个机器人能在各种未知地形行走，你会采用什么机器学习算法？
要对你的顾客进行分组，你会采用哪类算法？
垃圾邮件检测是监督学习问题，还是非监督学习问题？
什么是在线学习系统？
什么是核外学习？
什么学习算法是用相似度做预测？
模型参数和学习算法的超参数的区别是什么？
基于模型学习的算法搜寻的是什么？最成功的策略是什么？基于模型学习如何做预测？
机器学习主要的挑战是什么？
如果模型在训练集上表现好，但推广到新实例表现差，问题是什么？给出三个可能的解决方案。
什么是测试集，为什么要使用它？
验证集的目的是什么？
如果用测试集调节超参数，会发生什么？
什么是交叉验证，为什么它比验证集好？

练习解答

机器学习是关于构建可以从数据中学习的模型。学习意味着在某些任务中，根据一些绩效衡量，模型可以更好。
机器学习非常适用于：

我们没有算法解决方案的复杂问题；
可以替换手工调整规则的长列表；
构建适应波动环境的系统；
最后帮助人类学习（例如，数据挖掘）。
提示：不要把所以问题往机器学习上套，比如做网页，比如已经有高效算法的（图联通判断）。

标记的训练集是一个训练集，其中包含每个实例的所需解决方案（例如标签）。
两个最常见的监督任务是回归和分类。
常见的无监督任务包括聚类，可视化，降维和关联规则学习。
强化学习如果我们希望机器人学会在各种未知的地形中行走，那么学习可能会表现得最好，因为这通常是强化学习所解决的问题类型。有可能将问题表达为监督或半监督学习问题，但这种解决方式不太自然。
如果您不知道如何定义组，则可以使用聚类算法（无监督学习）将客户划分为类似客户的集群。但是，如果您知道您希望拥有哪些组，那么您可以将每个组的许多示例提供给分类算法（监督学习），并将所有客户分类到这些组中。
垃圾邮件检测是一种典型的监督学习问题：算法会输入许多电子邮件及其标签（垃圾邮件或非垃圾邮件）。
在线学习系统可以逐步学习，而不是批量学习系统。这使它能够快速适应不断变化的数据和自治系统，以及对大量数据的培训。
核外算法可以处理大量无法容纳在计算机主存中的数据。核心学习算法将数据分成小批量，并使用在线学习技术从这些小批量中学习。
基于实例的学习系统用心学习训练数据;然后，当给定一个新实例时，它使用相似性度量来查找最相似的学习实例并使用它们进行预测。
模型具有一个或多个模型参数，其确定在给定新实例的情况下它将预测什么（例如，线性模型的斜率）。学习算法试图找到这些参数的最佳值，以便模型很好地推广到新实例。超参数是学习算法本身的参数，而不是模型的参数（例如，要应用的正则化的量）。
基于模型的学习算法搜索模型参数的最佳值，使得模型将很好地推广到新实例。我们通常通过最小化成本函数来训练这样的系统，该成本函数测量系统在对训练数据进行预测时的糟糕程度，以及如果模型正规化则对模型复杂性的惩罚。为了进行预测，我们使用学习算法找到的参数值将新实例的特征提供给模型的预测函数。
机器学习中的一些主要挑战是缺乏数据，数据质量差，非代表性数据，无法提供信息的特征，过于简单的模型以及过度拟合训练数据的模型，以及过度复杂的模型过度拟合数据。
如果一个模型在训练数据上表现很好，但对新实例表现不佳，那么该模型可能会过度拟合训练数据（或者我们对训练数据非常幸运）。过度拟合的可能解决方案是获得更多数据，简化模型（选择更简单的算法，减少所使用的参数或特征的数量，或使模型正规化），或减少训练数据中的噪声。
测试集用于估计模型在生产中启动之前模型将对新实例进行的泛化错误。
验证集用于比较模型。它可以选择最佳模型并调整超参数。
如果使用测试集调整超参数，则存在过度拟合测试集的风险，并且您测量的泛化错误将是乐观的（您可能会启动比预期更差的模型）。
交叉验证是一种技术，可以比较模型（模型选择和超参数调整），而无需单独的验证集。这节省了宝贵的培训数据。

Exercises

In this chapter we have covered some of the most important concepts in Machine Learning. In the next chapters we will dive deeper and write more code, but before we do, make sure you know how to answer the following questions:

How would you define Machine Learning?
Can you name four types of problems where it shines?
What is a labeled training set?
What are the two most common supervised tasks?
Can you name four common unsupervised tasks?
What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?
What type of algorithm would you use to segment your customers into multiple groups?
Would you frame the problem of spam detection as a supervised learning prob‐lem or an unsupervised learning problem?
What is an online learning system?
What is out-of-core learning?
What type of learning algorithm relies on a similarity measure to make predic‐tions?
What is the difference between a model parameter and a learning algorithm’s hyperparameter?
What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
Can you name four of the main challenges in Machine Learning?
If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
What is a test set and why would you want to use it?
What is the purpose of a validation set?
What can go wrong if you tune hyperparameters using the test set?
What is cross-validation and why would you prefer it to a validation set?

Exercise Solutions

Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.
Machine Learning is great for complex problems for which we have no algorith‐mic solution, to replace long lists of hand-tuned rules, to build systems that adapt to fluctuating environments, and finally to help humans learn (e.g., data mining).
A labeled training set is a training set that contains the desired solution (a.k.a. a label) for each instance.
The two most common supervised tasks are regression and classification.
Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.
Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains since this is typically the type of problem that Reinforcement Learning tackles. It might be possible to express the problem as a supervised or semisupervised learning problem, but it would be less natural.
If you don’t know how to define the groups, then you can use a clustering algo‐rithm (unsupervised learning) to segment your customers into clusters of similar customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised learning), and it will classify all your customers into these groups.
Spam detection is a typical supervised learning problem: the algorithm is fed many emails along with their label (spam or not spam).
An online learning system can learn incrementally, as opposed to a batch learn‐ing system. This makes it capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.
Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.
An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).
Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. We usually train such systems by minimizing a cost function that measures how bad the sys‐tem is at making predictions on the training data, plus a penalty for model com‐plexity if the model is regularized. To make predictions, we feed the new instance’s features into the model’s prediction function, using the parameter val‐ues found by the learning algorithm.
Some of the main challenges in Machine Learning are the lack of data, poor data quality, nonrepresentative data, uninformative features, excessively simple mod‐els that underfit the training data, and excessively complex models that overfit the data.
If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are getting more data, simplifying the model (selecting a simpler algorithm, reducing the number of parameters or features used, or regularizing the model), or reducing the noise in the training data.
A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).
Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate vali‐dation set. This saves precious training data.

Chapter 13: Convolutional Neural Networks

练习

CNN相对于完全连接的DNN有什么优势可用于图像分类？
考虑由三个卷积层组成的CNN，每个卷积层具有3×3个内核，步长为2，以及SAME填充。最下层输出100个特征图，中间一个输出200，顶部输出400.输入图像是200×300像素的RGB图像。 CNN中的参数总数是多少？如果我们使用32位浮点数，那么在对单个实例进行预测时，该网络至少需要多少RAM？什么时候对50个图像的小批量培训？
如果您的GPU在训练CNN时内存不足，您可以尝试解决问题的五件事情是什么？
为什么要添加最大池化层而不是具有相同步幅的卷积层？
您希望何时添加本地响应规范化层？
与LeNet-5相比，您能说出AlexNet的主要创新吗？ GoogLeNet和ResNet的主要创新如何？

练习解答

1.这是CNN相对于完全连接的DNN进行图像分类的主要优点：

因为连续的层只是部分连接，并且因为它重复使用其权重，所以CNN的参数比完全连接的DNN少得多，这使得训练速度更快，降低了过度拟合的风险，并且需要的训练数据要少得多。
当CNN学习了可以检测特定功能的内核时，它可以在图像的任何位置检测到该功能。相反，当DNN在一个位置学习一个特征时，它只能在该特定位置检测到它。由于图像通常具有非常重复的特征，因此使用较少的训练示例，CNN能够比DNN更好地用于图像处理任务（例如分类）。
最后，DNN没有关于如何组织像素的先验知识;它不知道附近的像素是否接近。 CNN的架构嵌入了这一先验知识。较低层通常识别图像的小区域中的特征，而较高层将较低层特征组合成较大特征。这适用于大多数自然图像，使CNN与DNN相比具有决定性的先机性。

让我们计算CNN有多少参数。由于其第一个卷积层具有3×3个内核，并且输入具有三个通道（红色，绿色和蓝色），因此每个特征图具有3×3×3个权重，加上偏置项。这是每个功能图的28个参数。由于该第一卷积层具有100个特征映射，因此它具有总共2,800个参数。第二卷积层具有3×3个核，其输入是前一层的100个特征映射的集合，因此每个特征映射具有3×3×100 = 900个权重，加上偏差项。由于它有200个特征图，因此该层具有901×200 = 180,200个参数。最后，第三个和最后一个卷积层也有3×3个核，其输入是前一个层的200个特征映射的集合，因此每个特征映射具有3×3×200 = 1,800个权重，加上一个偏置项。由于它有400个特征图，因此该图层总共有1,801×400 = 720,400个参数。总而言之，CNN有2,800 + 180,200 + 720,400 = 903,400个参数。
现在让我们计算这个神经网络在对单个实例进行预测时需要多少RAM（至少）。首先让我们计算每一层的特征图大小。由于我们使用2和SAME填充的步幅，因此要素图的水平和垂直尺寸在每一层被除以2（必要时向上舍入），因此输入通道为200×300像素，第一层的特征地图是100×150，第二层的特征地图是50×75，第三层的特征地图是25×38。因为32位是4个字节而第一个卷积层有100个特征地图，所以第一层需要4 x 100×150×100 = 600万字节（约5.7 MB，考虑到1 MB = 1,024 KB和1 KB = 1,024字节）。第二层占用4×50×75×200 = 300万字节（约2.9MB）。最后，第三层占用4×25×38×400 = 1,520,000字节（约1.4MB）。但是，一旦计算了一个层，就可以释放前一层占用的内存，因此如果一切都经过优化，只需要6 + 9 = 1500万字节（约14.3 MB）的RAM（第二层时）刚刚计算过，但第一层占用的内存尚未释放）。但是等等，你还需要添加CNN参数占用的内存。我们之前计算过它有903,400个参数，每个参数使用4个字节，所以这增加了3,613,600个字节（大约3.4 MB）。所需的总RAM是（至少）18,613,600字节（约17.8 MB）。
最后，让我们计算在50个图像的小批量训练CNN时所需的最小RAM量。在训练期间，TensorFlow使用反向传播，这需要保留在前向传递期间计算的所有值，直到反向传递开始。因此，我们必须计算单个实例的所有层所需的总RAM，并将其乘以50。那时让我们开始以兆字节而不是字节计数。我们之前计算过，每个实例的三层分别需要5.7,2.9和1.4 MB。每个实例总共10.0 MB。因此，对于50个实例，总RAM为500 MB。再加上输入图像所需的RAM，即50×4×200×300×3 = 36百万字节（约34.3 MB），加上模型参数所需的RAM，大约3.4 MB（之前计算过）加上一些用于渐变的RAM（我们将忽略它们，因为它们可以逐渐释放，因为反向传播在反向传递过程中向下传播）。我们总共大约500.0 + 34.3 + 3.4 = 537.7 MB。这真的是一个乐观的最低限度。
如果您的GPU在训练CNN时内存不足，可以尝试解决问题的五件事情（除了购买具有更多RAM的GPU）：

减少小批量。
在一个或多个图层中使用更大的步幅减少维度。
删除一个或多个图层。
使用16位浮点数而不是32位浮点数。
在多个设备上分发CNN。

最大池层根本没有参数，而卷积层有很多参数（参见前面的问题）。
局部响应归一化层使得最强烈激活的神经元在相同位置但在相邻特征图中抑制神经元，这促使不同的特征图专门化并将它们分开，迫使它们探索更广泛的特征。它通常在较低层中使用，以具有较大的低级特征池，上层可以构建在其上。
与LeNet-5相比，AlexNet的主要创新是：（1）它更大更深，（2）它将卷积层直接叠加在一起，而不是在每个卷积层的顶部堆叠汇集层。 GoogLeNet的主要创新是引入了初始模块，这使得有可能拥有比以前的CNN架构更深的网络，参数更少。最后，ResNet的主要创新是跳过连接的引入，这使得它可以超越100层。可以说，它的简洁性和一致性也相当具有创新性。

Exercises

What are the advantages of a CNN over a fully connected DNN for image classi‐fication?
Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and SAME padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels. What is the total number of parameters in the CNN?If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images?
If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?
Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?
When would you want to add a local response normalization layer?
Can you name the main innovations in AlexNet, compared to LeNet-5? What about the main innovations in GoogLeNet and ResNet?

Exercise Solutions

These are the main advantages of a CNN over a fully connected DNN for image classification:

Because consecutive layers are only partially connected and because it heavily reuses its weights, a CNN has many fewer parameters than a fully connected DNN, which makes it much faster to train, reduces the risk of overfitting, and requires much less training data.
When a CNN has learned a kernel that can detect a particular feature, it can detect that feature anywhere on the image. In contrast, when a DNN learns a feature in one location, it can detect it only in that particular location. Since images typically have very repetitive features, CNNs are able to generalize much better than DNNs for image processing tasks such as classification, using fewer training examples.
Finally, a DNN has no prior knowledge of how pixels are organized; it does not know that nearby pixels are close. A CNN’s architecture embeds this prior knowledge. Lower layers typically identify features in small areas of the images, while higher layers combine the lower-level features into larger features. This works well with most natural images, giving CNNs a decisive head start com‐pared to DNNs.

Let’s compute how many parameters the CNN has. Since its first convolutional layer has 3 × 3 kernels, and the input has three channels (red, green, and blue), then each feature map has 3 × 3 × 3 weights, plus a bias term. That’s 28 parame‐ters per feature map. Since this first convolutional layer has 100 feature maps, it has a total of 2,800 parameters. The second convolutional layer has 3 × 3 kernels, and its input is the set of 100 feature maps of the previous layer, so each feature map has 3 × 3 × 100 = 900 weights, plus a bias term. Since it has 200 feature maps, this layer has 901 × 200 = 180,200 parameters. Finally, the third and last convolutional layer also has 3 × 3 kernels, and its input is the set of 200 feature maps of the previous layers, so each feature map has 3 × 3 × 200 = 1,800 weights, plus a bias term. Since it has 400 feature maps, this layer has a total of 1,801 × 400 = 720,400 parameters. All in all, the CNN has 2,800 + 180,200 + 720,400 = 903,400 parameters.
Now let’s compute how much RAM this neural network will require (at least) when making a prediction for a single instance. First let’s compute the feature map size for each layer. Since we are using a stride of 2 and SAME padding, the horizontal and vertical size of the feature maps are divided by 2 at each layer (rounding up if necessary), so as the input channels are 200 × 300 pixels, the first layer’s feature maps are 100 × 150, the second layer’s feature maps are 50 × 75, and the third layer’s feature maps are 25 × 38. Since 32 bits is 4 bytes and the first convolutional layer has 100 feature maps, this first layer takes up 4 x 100 × 150 × 100 = 6 million bytes (about 5.7 MB, considering that 1 MB = 1,024 KB and 1 KB = 1,024 bytes). The second layer takes up 4 × 50 × 75 × 200 = 3 million bytes (about 2.9 MB). Finally, the third layer takes up 4 × 25 × 38 × 400 = 1,520,000 bytes (about 1.4 MB). However, once a layer has been computed, the memory occupied by the previous layer can be released, so if everything is well optimized, only 6 + 9 = 15 million bytes (about 14.3 MB) of RAM will be required (when the second layer has just been computed, but the memory occupied by the first layer is not released yet). But wait, you also need to add the memory occupied by the CNN’s parameters. We computed earlier that it has 903,400 parameters, each using up 4 bytes, so this adds 3,613,600 bytes (about 3.4 MB). The total RAM required is (at least) 18,613,600 bytes (about 17.8 MB).
Lastly, let’s compute the minimum amount of RAM required when training the CNN on a mini-batch of 50 images. During training TensorFlow uses backpropa‐gation, which requires keeping all values computed during the forward pass until the reverse pass begins. So we must compute the total RAM required by all layers for a single instance and multiply that by 50! At that point let’s start counting in megabytes rather than bytes. We computed before that the three layers require respectively 5.7, 2.9, and 1.4 MB for each instance. That’s a total of 10.0 MB per instance. So for 50 instances the total RAM is 500 MB. Add to that the RAM required by the input images, which is 50 × 4 × 200 × 300 × 3 = 36 million bytes (about 34.3 MB), plus the RAM required for the model parameters, which is about 3.4 MB (computed earlier), plus some RAM for the gradients (we will neglect them since they can be released gradually as backpropagation goes down the layers during the reverse pass). We are up to a total of roughly 500.0 + 34.3 + 3.4 = 537.7 MB. And that’s really an optimistic bare minimum.
If your GPU runs out of memory while training a CNN, here are five things you could try to solve the problem (other than purchasing a GPU with more RAM):

Reduce the mini-batch size.
Reduce dimensionality using a larger stride in one or more layers.
Remove one or more layers.
Use 16-bit floats instead of 32-bit floats.
Distribute the CNN across multiple devices.

A max pooling layer has no parameters at all, whereas a convolutional layer has quite a few (see the previous questions).
A local response normalization layer makes the neurons that most strongly acti‐vate inhibit neurons at the same location but in neighboring feature maps, which encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon.
The main innovations in AlexNet compared to LeNet-5 are (1) it is much larger and deeper, and (2) it stacks convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The main innovation in GoogLeNet is the introduction of inception modules, which make it possible to have a much deeper net than previous CNN architectures, with fewer parameters. Finally, ResNet’s main innovation is the introduction of skip connec‐tions, which make it possible to go well beyond 100 layers. Arguably, its simplic‐ity and consistency are also rather innovative.

Chapter 14: Recurrent Neural Networks

练习

你能想象 seq2seq RNN 的几个应用吗？ seq2vec 的 RNN 呢？vex2seq 的 RNN 呢？
为什么人们使用编解码器 RNN 而不是简单的 seq2seq RNN 来自动翻译？
如何将卷积神经网络与 RNN 结合，来对视频进行分类？
使用 dynamic_rnn() 而不是 static_rnn() 构建 RNN 有什么好处？
你如何处理长度可变的输入序列？那么长度可变输出序列呢？
在多个 GPU 上分配深层 RNN 的训练和执行的常见方式是什么？

练习解答

以下是一些RNN应用程序：

对于序列到序列的RNN：预测天气（或任何其他时间序列），机器翻译（使用编码器 - 解码器架构），视频字幕，语音到文本，音乐生成（或其他序列生成），识别一首歌的和弦。
对于序列到矢量RNN：按音乐类型对音乐样本进行分类，分析书评的情绪，根据大脑植入物的读数预测失语症患者正在考虑的单词，预测概率用户希望根据她的观看历史观看电影（这是协作过滤的许多可能实现之一）。
对于矢量到序列RNN：图像字幕，基于当前艺术家的嵌入创建音乐播放列表，基于一组参数生成旋律，在图片中定位行人。

一般来说，如果你一次翻译一个单词，结果将是非常可怕的。例如，法语句子“Je vous en prie”的意思是“欢迎你”，但如果你一次翻译一个词，你会得到“我在祷告。”嗯？首先阅读整个句子然后翻译它会好得多。普通的序列到序列RNN将在读取第一个字之后立即开始翻译句子，而编码器 - 解码器RNN将首先读取整个句子然后翻译它。也就是说，人们可以想象一个简单的序列到序列的RNN，只要不确定接下来要说什么就会输出静音（就像人类翻译者必须翻译直播时那样）。
为了基于视觉内容对视频进行分类，一种可能的架构可以是（比方说）每秒一帧，然后通过卷积神经网络运行每一帧，将CNN的输出馈送到序列到矢量RNN ，最后通过softmax层运行其输出，为您提供所有类概率。对于培训，您只需使用交叉熵作为成本函数。如果您也想将音频用于分类，您可以将每秒音频转换为摄谱仪，将此摄谱仪输入CNN，并将此CNN的输出馈送到RNN（以及其他CNN的相应输出））。
使用dynamic_rnn（）而不是static_rnn（）构建RNN具有以下几个优点：

它基于while_loop（）操作，该操作能够在反向传播期间将GPU的内存交换到CPU的内存，从而避免内存不足错误。
它可以说更容易使用，因为它可以直接将单个张量作为输入和输出（涵盖所有时间步骤），而不是张量列表（每个时间步长一个）。无需堆叠，取消堆叠或转置。
它生成一个较小的图形，更容易在TensorBoard中可视化。

5.要处理可变长度输入序列，最简单的选项是在调用static_rnn（）或dynamic_rnn（）函数时设置sequence_length参数。另一种选择是填充较小的输入（例如，用零）以使它们与最大输入相同（如果输入序列都具有非常相似的长度，则这可能比第一选项快）。要处理可变长度输出序列，如果事先知道每个输出序列的长度，可以使用sequence_length参数（例如，考虑序列到序列的RNN，用暴力标记视频中的每一帧得分：输出序列与输入序列的长度完全相同）。如果您事先不知道输出序列的长度，则可以使用填充技巧：始终输出相同大小的序列，但忽略序列结束标记之后的任何输出（通过在计算时忽略它们）成本函数）。

要在多个GPU上分发深度RNN的训练和执行，常见的技术就是将每个层放在不同的GPU上（参见第12章）。

Exercises

Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN? And a vector-to-sequence RNN?
Why do people use encoder–decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?
How could you combine a convolutional neural network with an RNN to classify videos?
What are the advantages of building an RNN using dynamic_rnn() rather than static_rnn()?
How can you deal with variable-length input sequences? What about variable-length output sequences?
What is a common way to distribute training and execution of a deep RNN across multiple GPUs?

Exercise Solutions

Here are a few RNN applications:

For a sequence-to-sequence RNN: predicting the weather (or any other time series), machine translation (using an encoder–decoder architecture), video captioning, speech to text, music generation (or other sequence generation), identifying the chords of a song.
For a sequence-to-vector RNN: classifying music samples by music genre, ana‐lyzing the sentiment of a book review, predicting what word an aphasic patient is thinking of based on readings from brain implants, predicting the probabil‐ity that a user will want to watch a movie based on her watch history (this is one of many possible implementations of collaborative filtering).
For a vector-to-sequence RNN: image captioning, creating a music playlist based on an embedding of the current artist, generating a melody based on a set of parameters, locating pedestrians in a picture (e.g., a video frame from a self-driving car’s camera).

In general, if you translate a sentence one word at a time, the result will be terri‐ble. For example, the French sentence “Je vous en prie” means “You are welcome,” but if you translate it one word at a time, you get “I you in pray.” Huh? It is much better to read the whole sentence first and then translate it. A plain sequence-to-sequence RNN would start translating a sentence immediately after reading the first word, while an encoder–decoder RNN will first read the whole sentence and then translate it. That said, one could imagine a plain sequence-to-sequence RNN that would output silence whenever it is unsure about what to say next (just like human translators do when they must translate a live broadcast).
To classify videos based on the visual content, one possible architecture could be to take (say) one frame per second, then run each frame through a convolutional neural network, feed the output of the CNN to a sequence-to-vector RNN, and finally run its output through a softmax layer, giving you all the class probabili‐ties. For training you would just use cross entropy as the cost function. If you wanted to use the audio for classification as well, you could convert every second of audio to a spectrograph, feed this spectrograph to a CNN, and feed the output of this CNN to the RNN (along with the corresponding output of the other CNN).
Building an RNN using dynamic_rnn() rather than static_rnn() offers several advantages:

It is based on a while_loop() operation that is able to swap the GPU’s memory to the CPU’s memory during backpropagation, avoiding out-of-memory errors.
It is arguably easier to use, as it can directly take a single tensor as input and output (covering all time steps), rather than a list of tensors (one per time step). No need to stack, unstack, or transpose.
It generates a smaller graph, easier to visualize in TensorBoard.

To handle variable length input sequences, the simplest option is to set the sequence_length parameter when calling the static_rnn() or dynamic_rnn() functions. Another option is to pad the smaller inputs (e.g., with zeros) to make them the same size as the largest input (this may be faster than the first option if the input sequences all have very similar lengths). To handle variable-length out‐put sequences, if you know in advance the length of each output sequence, you can use the sequence_length parameter (for example, consider a sequence-to-sequence RNN that labels every frame in a video with a violence score: the output sequence will be exactly the same length as the input sequence). If you don’t know in advance the length of the output sequence, you can use the padding trick: always output the same size sequence, but ignore any outputs that come after the end-of-sequence token (by ignoring them when computing the cost function).
To distribute training and execution of a deep RNN across multiple GPUs, a common technique is simply to place each layer on a different GPU (see Chap‐ter 12).

Chapter 15: Autoencoders

练习

自动编码器使用的主要任务是什么？
假设你想训练一个分类器，你有很多未标记的训练数据，但是只有几千个标记的实例。自动编码器如何帮助？你将如何进行？
如果一个自动编码器完美地重建输入，它一定是好的吗？自动编码器？如何评价自动编码器的性能？
什么是欠完备和过完备的自动编码器？过度完备的自动编码器的主要风险是什么？超完备自动编码器的主要风险是什么？
如何在堆叠式自动编码器中系紧砝码？这样做有什么意义呢？
什么是一种常见的技术来可视化的特点，学习下层的堆叠自动编码器？更高层怎么办？
什么是生成模型？你能说出一种生成式自动编码器吗？

练习解答

以下是自动编码器用于的一些主要任务：

特征提取
无人监督的预训练
维度降低
生成模型
异常检测（自动编码器通常不利于重建异常值）

如果你想训练一个分类器，你有大量的未标记的训练数据，但是只有几千个标记的实例，那么你可以首先在完整的数据集（标记和未标记）上训练一个深度的自动编码器，然后再将其下半部分用于分类器（即，重复使用编码层）。编码层，包括使用标记数据训练分类器。如果您的标记数据很少，则可能需要在训练分类器时冻结重复使用的层。
自动编码器完美地重建其输入的事实并不一定如此意味着它是一个很好的自动编码器;也许它只是一个过度完整的自动编码器学会了将其输入复制到编码层然后再输出到输出。实际上，即使编码层包含单个神经元，也是可能的对于一个非常深的自动编码器来学习将每个训练实例映射到不同的编码（例如，第一个实例可以映射到0.001，第二个实例可以映射到0.002，即第三到0.003，等等），它可以“用心”学习重建右边每个编码的训练实例。它将完美地重建其输入没有真正学习数据中任何有用的模式。在实践中这样的映射不太可能发生，但它说明了完美的重建不是这样的事实保证自动编码器学到了什么有用的东西。但是，如果它产生非常糟糕的重建，然后它几乎保证是一个糟糕的自动编码器。为了评估自动编码器的性能，一种选择是测量重建损失（例如，计算MSE，输出的均方值）减去输入）。再次，高重建损失是一个很好的迹象自动编码器很糟糕，但重建损失很小并不能保证好。您还应该根据它将使用的内容来评估自动编码器对于。例如，如果您将其用于无人监督的分类器预训练，那么你还应该评估分类器的性能。
欠完全自动编码器是一种编码层小于编码层的编码器输入和输出层。如果它更大，那么它是一个过完备的自动编码器。欠完全自动编码器的主要风险是它可能无法完成重建输入。过度完整的自动编码器的主要风险是它可能只是将输入复制到输出，而不学习任何有用的功能。
要将编码器层及其相应解码器层的权重联系起来简单地使解码器权重等于编码器权重的转置。这会将模型中的参数数量减少一半，通常会进行培训通过较少的训练数据更快地收敛，并降低过度拟合的风险训练集。
为了可视化由堆叠自动编码器的下层学习的特征，通常的技术是通过将每个权重向量重新整形为输入图像的大小来简单地绘制每个神经元的权重（例如，对于MNIST，重塑一个权重向量）。形状[784]至[28,28]）。为了可视化更高层学习的特征，一种技术是显示最能激活每个神经元的训练实例。
生成模型是能够随机生成类似于训练实例的输出的模型。例如，一旦在MNIST数据集上成功训练，生成模型可用于随机生成数字的真实图像。输出分布通常类似于训练数据。例如，由于MNIST包含每个数字的许多图像，因此生成模型将输出大致相同数量的每个数字的图像。一些生成模型可以参数化。例如，仅生成某种输出。生成自动编码器的一个例子是变分自动编码器。

Exercise

What are the main tasks that autoencoders are used for?
Suppose you want to train a classifier and you have plenty of unlabeled training data, but only a few thousand labeled instances. How can autoencoders help? How would you proceed?
If an autoencoder perfectly reconstructs the inputs, is it necessarily a good autoencoder? How can you evaluate the performance of an autoencoder?
What are undercomplete and overcomplete autoencoders? What is the main risk of an excessively undercomplete autoencoder? What about the main risk of an overcomplete autoencoder?
How do you tie weights in a stacked autoencoder? What is the point of doing so?
What is a common technique to visualize features learned by the lower layer of a stacked autoencoder? What about higher layers?
What is a generative model? Can you name a type of generative autoencoder?

Exercise Solutions

Here are some of the main tasks that autoencoders are used for:

Feature extraction
Unsupervised pretraining
Dimensionality reduction
Generative models
Anomaly detection (an autoencoder is generally bad at reconstructing outliers)

If you want to train a classifier and you have plenty of unlabeled training data, but only a few thousand labeled instances, then you could first train a deep autoencoder on the full dataset (labeled + unlabeled), then reuse its lower half for the classifier (i.e., reuse the layers up to the codings layer, included) and train the classifier using the labeled data. If you have little labeled data, you probably want to freeze the reused layers when training the classifier.
The fact that an autoencoder perfectly reconstructs its inputs does not necessarily mean that it is a good autoencoder; perhaps it is simply an overcomplete autoen‐coder that learned to copy its inputs to the codings layer and then to the outputs. In fact, even if the codings layer contained a single neuron, it would be possible for a very deep autoencoder to learn to map each training instance to a different coding (e.g., the first instance could be mapped to 0.001, the second to 0.002, the third to 0.003, and so on), and it could learn “by heart” to reconstruct the right training instance for each coding. It would perfectly reconstruct its inputs without really learning any useful pattern in the data. In practice such a mapping is unlikely to happen, but it illustrates the fact that perfect reconstructions are not a guarantee that the autoencoder learned anything useful. However, if it produces very bad reconstructions, then it is almost guaranteed to be a bad autoencoder. To evaluate the performance of an autoencoder, one option is to measure the reconstruction loss (e.g., compute the MSE, the mean square of the outputs minus the inputs). Again, a high reconstruction loss is a good sign that the autoencoder is bad, but a low reconstruction loss is not a guarantee that it is good. You should also evaluate the autoencoder according to what it will be used for. For example, if you are using it for unsupervised pretraining of a classifier, then you should also evaluate the classifier’s performance.
An undercomplete autoencoder is one whose codings layer is smaller than the input and output layers. If it is larger, then it is an overcomplete autoencoder. The main risk of an excessively undercomplete autoencoder is that it may fail to reconstruct the inputs. The main risk of an overcomplete autoencoder is that it may just copy the inputs to the outputs, without learning any useful feature.
To tie the weights of an encoder layer and its corresponding decoder layer, you simply make the decoder weights equal to the transpose of the encoder weights. This reduces the number of parameters in the model by half, often making train‐ing converge faster with less training data, and reducing the risk of overfitting the training set.
To visualize the features learned by the lower layer of a stacked autoencoder, a common technique is simply to plot the weights of each neuron, by reshaping each weight vector to the size of an input image (e.g., for MNIST, reshaping a weight vector of shape [784] to [28, 28]). To visualize the features learned by higher layers, one technique is to display the training instances that most activate each neuron.
A generative model is a model capable of randomly generating outputs that resemble the training instances. For example, once trained successfully on the MNIST dataset, a generative model can be used to randomly generate realistic images of digits. The output distribution is typically similar to the training data. For example, since MNIST contains many images of each digit, the generative model would output roughly the same number of images of each digit. Some generative models can be parametrized—for example, to generate only some kinds of outputs. An example of a generative autoencoder is the variational autoencoder.

Chapter 16: Reinforcement Learning

练习

1.您如何定义强化学习？它与常规监督或无监督学习有什么不同？
2.您能想到本章未提及的RL的三种可能应用吗？对于他们每个人来说，环境是什么？代理商是什么？可能的行动是什么？有什么奖励？
3.折扣率是多少？如果修改计数率，最优政策会改变吗？
4.您如何衡量强化学习代理的表现？
5.什么是信用分配问题？什么时候发生？你怎么能减轻它？
6.使用重放内存有什么意义？
7.什么是非策略RL算法？

练习解答

强化学习是机器学习的一个领域，旨在创建能够以最大化奖励的方式在环境中采取行动的代理。 RL与常规监督和无监督学习之间存在许多差异。以下是一些：

在有监督和无监督学习中，目标通常是在数据中找到模式。在强化学习中，目标是找到一个好的策略。
与监督学习不同，代理人没有明确给出“正确”的答案。它必须通过反复试验来学习。
与无监督学习不同，通过奖励有一种监督形式。我们不会告诉代理如何执行任务，但我们会告诉它何时进行任务或何时失败。
强化学习代理需要在探索环境，寻找获得奖励的新方法以及利用已经知道的奖励来源之间找到适当的平衡点。相比之下，有监督和无监督的学习系统通常不需要担心探索;他们只是根据他们给出的训练数据。
在有监督和无监督的学习中，训练实例通常是独立的（事实上，它们通常是洗牌的）。在强化学习中，连续观察通常不是独立的。在移动之前，代理可能会在环境的同一区域停留一段时间，因此连续的观察将非常相关。在某些情况下，使用重放存储器来确保训练算法获得相当独立的观察。

除了第16章中提到的那些之外，以下是强化学习的一些可能应用：

音乐个性化

环境是用户的个性化网络电台。代理是决定该用户接下来要播放的歌曲的软件。其可能的行动是播放目录中的任何歌曲（它必须尝试选择用户将喜欢的歌曲）或播放广告（它必须尝试选择用户将被介入的广告）。每次用户收听歌曲时获得小奖励，每次用户收听广告时获得更大奖励，当用户跳过歌曲或广告时获得负奖励，如果用户离开则获得非常负面奖励。

营销

环境是贵公司的营销部门。代理商是一个软件，根据他们的个人资料和购买历史记录定义应向哪些客户发送邮件活动（对于每个客户，它有两个可能的操作：发送或不发送）。它会对邮寄广告系列的费用产生负面回报，并对此广告系列产生的估算收入产生积极回报。

产品交付

让代理商控制一批运货卡车，决定他们应该在油库接收什么，他们应该去哪里，他们应该放下什么，等等。对于按时交付的每种产品，他们都会得到积极的回报，对于延迟交付，他们会得到负面的回报。

在估算行动的价值时，强化学习算法通常会将此行为带来的所有奖励加起来，给予即时奖励更多的权重，减少后期奖励的权重（考虑到行动对近期的影响大于在遥远的未来）。为了对此进行建模，通常在每个时间步应用折扣率。例如，在折扣率为0.9的情况下，当您估算行动的价值时，两个时间段后收到的100的奖励仅计为0.92×100 = 81。您可以将计算率视为衡量未来相对于现在的估值程度的指标：如果它非常接近1，那么未来的估值几乎与现在一样多。如果它接近0，那么只有直接奖励很重要。当然，这极大地影响了最优政策：如果你重视未来，你可能愿意为最终奖励的前景忍受很多直接的痛苦，而如果你不重视未来，你就会抓住您可以找到的任何直接奖励，永远不会投资于未来。
要衡量强化学习代理的表现，您可以简单地总结其获得的奖励。在模拟环境中，您可以运行许多epi-sodes并查看平均得到的总奖励（并且可能会查看最小值，最大值，标准差等）。
信用分配问题是，当强化学习代理收到奖励时，它无法直接了解其先前的哪些行为对此奖励有贡献。它通常发生在一个动作与所产生的奖励之间存在很大的延迟时（例如，在Atari的乒乓球比赛期间，在球员击球之前和赢得该球的那一刻之间可能会有几十个时间步长）。减轻它的一种方法是在可能的情况下为代理人提供短期奖励。这通常需要有关任务的先验知识。例如，如果我们想要建立一个学会下棋的代理人，而不是仅在它赢得比赛时给予奖励，我们可以在每次捕获对手的棋子时给予奖励。
代理人通常可以在一段时间内保持在其环境的同一区域，因此在这段时间内，它的所有经验都非常相似。这可以在学习算法中引入一些偏差。它可能会调整这个环境区域的政策，但一旦离开这个区域就不会表现良好。要解决此问题，您可以使用重放内存; 代理人不会仅使用最直接的学习经验，而是根据过去经验的缓冲来学习，最近也不是最近的经历（也许这就是为什么我们在晚上做梦：重播我们当天的经历并更好地学习他们？）。
非策略RL算法学习最优策略的值（即，如果代理最佳地行为，则可以为每个状态预期的折扣奖励的总和），而与代理实际行为的方式无关。 Q-Learning是这种算法的一个很好的例子。相反，on-policy算法学习代理实际执行的策略的值。

Exercises

How would you define Reinforcement Learning? How is it different from regular supervised or unsupervised learning?
Can you think of three possible applications of RL that were not mentioned in this chapter? For each of them, what is the environment? What is the agent?What are possible actions? What are the rewards?
What is the discount rate? Can the optimal policy change if you modify the dis‐count rate?
How do you measure the performance of a Reinforcement Learning agent?
What is the credit assignment problem? When does it occur? How can you allevi‐ate it?
What is the point of using a replay memory?
What is an off-policy RL algorithm?

Exercise Solutions

Reinforcement Learning is an area of Machine Learning aimed at creating agents capable of taking actions in an environment in a way that maximizes rewards over time. There are many differences between RL and regular supervised and unsupervised learning. Here are a few:

In supervised and unsupervised learning, the goal is generally to find patterns in the data. In Reinforcement Learning, the goal is to find a good policy.
Unlike in supervised learning, the agent is not explicitly given the “right” answer. It must learn by trial and error.
Unlike in unsupervised learning, there is a form of supervision, through rewards. We do not tell the agent how to perform the task, but we do tell it when it is making progress or when it is failing.
A Reinforcement Learning agent needs to find the right balance between exploring the environment, looking for new ways of getting rewards, and exploiting sources of rewards that it already knows. In contrast, supervised and unsupervised learning systems generally don’t need to worry about explora‐tion; they just feed on the training data they are given.
In supervised and unsupervised learning, training instances are typically inde‐pendent (in fact, they are generally shuffled). In Reinforcement Learning, con‐secutive observations are generally not independent. An agent may remain in the same region of the environment for a while before it moves on, so consecu‐tive observations will be very correlated. In some cases a replay memory is used to ensure that the training algorithm gets fairly independent observa‐tions.

Here are a few possible applications of Reinforcement Learning, other than those mentioned in Chapter 16:

Music personalization

The environment is a user’s personalized web radio. The agent is the software deciding what song to play next for that user. Its possible actions are to play any song in the catalog (it must try to choose a song the user will enjoy) or to play an advertisement (it must try to choose an ad that the user will be inter‐ested in). It gets a small reward every time the user listens to a song, a larger reward every time the user listens to an ad, a negative reward when the user skips a song or an ad, and a very negative reward if the user leaves.

Marketing

The environment is your company’s marketing department. The agent is the software that defines which customers a mailing campaign should be sent to, given their profile and purchase history (for each customer it has two possi‐ble actions: send or don’t send). It gets a negative reward for the cost of the mailing campaign, and a positive reward for estimated revenue generated from this campaign.

Product delivery

Let the agent control a fleet of delivery trucks, deciding what they should pick up at the depots, where they should go, what they should drop off, and so on. They would get positive rewards for each product delivered on time, and negative rewards for late deliveries.

When estimating the value of an action, Reinforcement Learning algorithms typ‐ically sum all the rewards that this action led to, giving more weight to immediate rewards, and less weight to later rewards (considering that an action has more influence on the near future than on the distant future). To model this, a discount rate is typically applied at each time step. For example, with a discount rate of 0.9, a reward of 100 that is received two time steps later is counted as only 0.92 × 100 = 81 when you are estimating the value of the action. You can think of the dis‐count rate as a measure of how much the future is valued relative to the present: if it is very close to 1, then the future is valued almost as much as the present. If it is close to 0, then only immediate rewards matter. Of course, this impacts the optimal policy tremendously: if you value the future, you may be willing to put up with a lot of immediate pain for the prospect of eventual rewards, while if you don’t value the future, you will just grab any immediate reward you can find, never investing in the future.
To measure the performance of a Reinforcement Learning agent, you can simply sum up the rewards it gets. In a simulated environment, you can run many epi‐sodes and look at the total rewards it gets on average (and possibly look at the min, max, standard deviation, and so on).
The credit assignment problem is the fact that when a Reinforcement Learning agent receives a reward, it has no direct way of knowing which of its previous actions contributed to this reward. It typically occurs when there is a large delay between an action and the resulting rewards (e.g., during a game of Atari’s Pong, there may be a few dozen time steps between the moment the agent hits the ball and the moment it wins the point). One way to alleviate it is to provide the agent with shorter-term rewards, when possible. This usually requires prior knowledge about the task. For example, if we want to build an agent that will learn to play chess, instead of giving it a reward only when it wins the game, we could give it a reward every time it captures one of the opponent’s pieces.
An agent can often remain in the same region of its environment for a while, so all of its experiences will be very similar for that period of time. This can intro‐duce some bias in the learning algorithm. It may tune its policy for this region of the environment, but it will not perform well as soon as it moves out of this region. To solve this problem, you can use a replay memory; instead of using only the most immediate experiences for learning, the agent will learn based on a buffer of its past experiences, recent and not so recent (perhaps this is why we dream at night: to replay our experiences of the day and better learn from them?).
An off-policy RL algorithm learns the value of the optimal policy (i.e., the sum of discounted rewards that can be expected for each state if the agent acts opti‐mally), independently of how the agent actually acts. Q-Learning is a good exam‐ple of such an algorithm. In contrast, an on-policy algorithm learns the value of the policy that the agent actually executes, including both exploration and exploi‐tation.

人工智能

hands-on-ml-with-sklearn-and-tf(Aurelien Geron) 课后习题解答

《Hands-on Machine Learning with Scikit-Learn and TensorFlow 》资源

CHAPTER 1 The Machine Learning Landscape

练习

练习解答

Exercises

Exercise Solutions

Chapter 13: Convolutional Neural Networks

练习

练习解答

Exercises

Exercise Solutions

Chapter 14: Recurrent Neural Networks

练习

练习解答

Exercises

Exercise Solutions

Chapter 15: Autoencoders

练习

练习解答

Exercise

Exercise Solutions

Chapter 16: Reinforcement Learning

练习

练习解答

Exercises

Exercise Solutions